EQUIVALENCE OF CONCENTRATION INEQUALITIES FOR 
LINEAR AND NON-LINEAR FUNCTIONS 



T. J. SULLIVAN AND H. OWHADI 

Abstract. We consider a random variable X that takes values in a (possibly 
infinite-dimensional) topological vector space X. We show that, with respect 
to an appropriate "normal distance" on X, concentration inequalities for linear 
and non-linear functions of X are equivalent. This normal distance corresponds 
naturally to the concentration rate in classical concentration results such as 
Gaussian concentration and concentration on the Euclidean and Hamming 
cubes. Under suitable assumptions on the roundness of the sets of interest, 
the concentration inequalities so obtained are asymptotically optimal in the 
high-dimensional limit. 

1. Introduction 

It is by now almost classical that smooth enough convex functions enjoy good 
concentration properties; see e.g. [15] [ [ | [ ] for surveys of the literature. It 
is also known that convexity can be neglected in the Gaussian case and that the 
smoothness assumptions are not essential and can be replaced, for instance, with 
bounded martingale differences; see e.g. [20] [21] and also [29]. 

A common feature of many concentration results is that an appropriate notion of 
distance is needed, e.g. Talagrand's convex distance [27]. In this paper, a notion of 
"normal distance" on a topological vector space X is introduced through a technique 
commonly used in large deviations theory, Chernoff bounding, i.e. estimating the 
measure of a set by using a containing half-space. Although simple, this method 
leads to a notion of distance that is in some sense "natural" with respect to the 
duality structure on X . Remarkably, with respect to this distance, concentration 
inequalities on the tails of linear, convex, quasiconvex and non-linear functions on 
X are mutually equivalent. 

Concentration of measure is based on a simple but non-trivial observation orig- 
inally due to Levy [17]: in a high-dimensional probability space, "nearly all" the 
probability mass lies close to any set with measure at least ^; put another way, 
functions of many independent variables with small sensitivity to each individual 
input are very nearly constant. A typical concentration inequality is of the form 

P[|/(A) - m\ > r] < & exp(-C 2 r 2 ), (1.1) 
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where / is a suitably well-behaved function, X is a random variable such that the 
push-forward measure (/ o X)*P has some concentration property, and m is cither 
the mean value E[/(X)] or median value M.[f(X)}; sometimes the control is one- 
sided, and the absolute value in (1.1) is omitted. A notable feature of this paper is 
that it provides concentration inequalities with m = f(F\X\). 

The key property of the normal distance of this paper is contained in the following 
portmanteau theorem for the equivalence of various concentration inequalities with 
respect to normal distance: 

Theorem 1.1. Let X be a real topological vector space and X* its continuous dual 
space. Let "J: X* — > [0,+oo] be positively homogeneous of degree one. Define the 
^-normal distance from x € X to A C X by 



dj_,\t(x, A) := sup 



(u,x-p} + 

*(!/) 



p G X and v G X* such that, 
for all a G A, (v, a) < (v,p) 



with the convention that 0/0 = 0. Then the following statements about any random 
variable X that takes values in X are equivalent: 
(i) for every closed half-space H Pjt , := {x G X \ (v, x — p) < 0} C X , where p G X 
and v G X* , 



¥[X G M p ^] < exp 
(ii) for every convex set K C X , 



d_L,*(E[X] 



d ± ^(E[X],K)< 



dj_^(E[X],Af 



d ± ,*(E[Jf],/- 1 ([-oo,fl])) s 



U {±00} and every 9 G R U {±00}, 
^^(E^J-^t-oo^])) 2, 



F[X GK]< exp 
(in) for every measurable AC. X, 

F[X G A] < exp 

(iv) for every measurable f: X — > R U {±00} and every 9 G R U {±00}, 

P[/P0 < 0} < exp 

(v) for every quasiconvex f: X - 

nf(X) <9}< exp 

Note that if / is quasilinear (i.e. both / and — / are quasiconvex), then formu- 
lation (v) yields concentration inequalities for both the lower and upper tails of 
f(X). 

The notation and setting of the paper are covered in section 2, along with a 
review of some definitions and results from the concentration-of-measure literature. 
Normal distance is defined and its properties (including theorem 1.1) are examined 
in section 3. In section 4, the normalizing function ^ is determined explicitly in 
several cases, thereby connecting theorem 1.1 with classical concentration results. 
In particular, proposition 4.4 identifies the normal distance that corresponds to the 
concentration of a vector, the entries of which are the empirical (sampled) means 
of functions of independent random variables. In section 5, it is shown that the 
inequality in theorem 1.1 (iii) is asymptotically sharp (in the sense used in large 
deviations theory) in the high-dimensional limit, provided that A is convex and 
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Figure 2.1. A convex set A and its outward normal cones 
at points p,q,r 6 A. dK is smooth at p € dK , so N* A is a 
half-line; <9A has a vertex at q, so N*A is a pointed convex 
cone with non-empty interior; at the interior point r, N*A is 
the empty set. 





"sufficiently round" at those points of A that are closest to the center of mass ELY]. 
Finally, for completeness, the method of Chernoff bounds and its consequences for 
convex sets are reviewed in an appendix (section 6). 

2. Notation and Background 

Let X be a real topological vector space. Let X* denote the continuous dual 
space of X and let (£, x) denote the dual pairing between £ £ X* and x £ X; (v, t) 
will also denote the dual pairing between v £ X** and I £ X*. It is not strictly 
necessary to assume that X is locally convex, but the results of this paper may be 
trivially true if X* does not contain enough linear functionals. 

2.1. Half-Spaces. Given p £ X and v £ X*, H p>1/ will denote the closed half-space 
of X that has p in its frontier and outward-pointing normal v, i. e. 

M p> „:={x£X\(v,x)<(v,p)}. (2.1) 

Note well the degenerate case H p .o = X. Every (p, v) G X x X* defines a unique 
closed half-space of X, whereas a given closed half-space can have multiple distinct 
representations: H Pi „ = W p i y if, and only if, v is a positive multiple of v' and 
(u,p^ P ') = (u',p-'p') = 0. ' 

2.2. Convex Sets and Cones. The closed convex hull of A C X will be denoted 
by co(^4). Given a closed convex set K C X and p £ K, N*AT denotes the outward 
normal cone to K at p, and N*A denotes the outward normal bundle of K: 

N;A := {v e X* I K C W p>v } , (2.2) 

N*A := {(p,v) £ X x X* \p £ K,v £ N* p K} . (2.3) 

The outward normal cone N*K is a pointed convex cone: it contains 0, is convex, 
and s x vi + s 2 v 2 £ N*K for all s x , s 2 > and all v x , u 2 £ N* A. Also, N£A = {0} if 
p is an interior point of A. Note that N*A C X x A"* is not necessarily a convex 
set. See figure 2.1 for an illustration. 
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2.3. Quasiconvexity. If K C X is a convex set, then a function / : K — > RU{±oo} 
is said to be quasiconvex if, for every flglU {±00}, the sublevel set 

/^([-oo.fl]) := {x e K I -co < /(:r) < 0} (2.4) 

is a convex set; equivalently, / is quasiconvex if, for all x,y S K and t € [0, 1], 

/((l - i)a; + ty) < max{/(x), /(</)}. (2.5) 

/ is said to be quasiconcave if — / is quasiconvex, and / is said to be quasilinear 
if it is both quasiconvex and quasiconcave. Every convex (resp. concave, linear) 
function is quasiconvex (resp. quasiconcave, quasilinear), but not vice versa. In 
particular, a function / : M. N — > R is quasilinear if, and only if, it is the composition 
of a monotone function with a linear functional on M. N [5, p. 122]. 



2.4. Indicator and Characteristic Functions. Given a set A C X, 1a and xa 

denote its indicator function and characteristic function respectively: 

Ms) := ^ ^% ^ 
10, lix^A; 



X ^ ):= i+oo, ifa^A (2 - ?) 



Note that, for any convex set K C X , \k is a convex function. 



2.5. Probabilistic Notions. Let (f2, jF, P) be a probability space and let X : f 2 — > 
A be an A-valued random variable. E[-] denotes the expectation operator with 
respect to the probability measure P: E[X] is defined to be any m € X such that 



E[(£, X - m)] = / - to) dP(w) = for all A" 



(2- 



if X* separates the points of X (e.g. if A" is a Banach space), then K[X] is unique. 
For Y : f2 — > K, any m £ R that satisfies 



sup < v £ 



*[Y < v] < - } < to < inf { v € 



(2.9) 



will be called a median of Y and denoted M[Y]. Mx ■ X* —> [0, +00] denotes the 
moment- generating function defined by 



M x (i) ■= E [exp{£,X)] for all £ e X*. 



(2.10) 



Ax : X* -> lU {±00} denotes the cumulant generating function (or logarithmic 
moment-generating function) defined by 



Ax{£) := \ogM x {£) = logE [exp(^, X)] for all £ e X* . 
By Holder's inequality, Ax is a convex function. 



(2.11) 
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2.6. Talagrand's Inequalities. It has been known for some time that convex sets 
and functions enjoy good concentration properties; moreover, to get good concen- 
tration results, it is necessary to measure distances in the right way. 

For example, a theorem of Talagrand shows that if a convex set K C M. N occupies 
a "significant" portion of the Hamming cube { — 1, +1}^ and t ^> 1, then "nearly 
all" of the points of the Hamming cube lie within Euclidean distance t of K. Define 
the Euclidean Hausdorff distance from x € R N to A C R-^ by 

cfeau S (z, A) := mi{\\x - a\\ 2 \ a G A}. (2.12) 

Talagrand [26] showed that if X is uniformly distributed in {—1, +1}^ then, for any 
A C R w , E[exp(d H au S (^,co(A)) 2 /8)] < F[X G A] -1 ; hence, Chebyshev's inequality 
implies that, for any t > 0, 

F[X G A]P[d H&us (X,co(A)) >t}< exp . (2.13) 



More interesting results can be obtained if one uses not the Euclidean distance 
but the Hamming distance — or, more accurately, an infimum over weighted Ham- 
ming distances. For w = (wi, . . . , wn) G [0, +oo) N , define the w-weighted Hamming 
distance d w on a product of sets X — [] n=1 X n by 

N 

d w (x,y) := w n (l - S Xn ,y n ); ( 2 -14) 

n— 1 

that is, d w (x,y) is the w-weighted sum of the number of components in which 
x,y G <-f differ. For cc G A" and A C_ X, set d w (x, A) := 'm.i a ^Ad w (x,a). Define 
Talagrand's convex distance from x £ X to A C X by 



d.Tai(x,A) := sup< du,(a;,yl) 



€[o,+oo) w ,x;^= i r' ( 2 - 15 ) 

n=l J 



and, for A,B <Z X, let dxai(A-B) : = infaeA <^Tai(a, B). Talagrand [27, §4.1] showed 
that if X = (Xi, . . . , Xn) is any A"- valued random variable with independent com- 
ponents, then 

PLY G A]F[X eB}< exp (_ d ™(AB) 2 \ (2 16) 



These bounds on the probabilities of sets lead to deviation inequalities for convex 
Lipschitz functions. For example (c/. [13] [26]), let X be any random variable in the 
unit cube in M. N with independent components, and let /: [0, 1]^ — > R be convex 
and Lipschitz with ||/||Lip < 1; then, for any t > 0, 

P[/(X) > M[/(JT)] + t] < 2 exp . (2.17) 

Note, however, that these results use not only the convexity of the function of 
interest, but also require Lipschitz continuity. What concentration inequalities can 
be shown to hold without smoothness assumptions? 
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2.7. McDiarmid's Inequality. One smoothness-free concentration inequality is 
McDiarmid's inequality [20], also known as the bounded differences inequality, which 
itself generalizes an earlier inequality of Hoeffding [11]. McDiarmid's inequality is 
by no means the strongest concentration-of-measure inequality in the literature, 
but is useful because of its simple hypotheses and proof. 
Define the McDiarmid diameter of /, denoted T>\f\, by 

/ N \ V 2 

m-=\V,V n [f?J , (2.18) 

where the n th McDiarmid subdiameter T> n [f] is defined by 

V n [f] := sup{|/(x) - f(y)\ | Xj = y 3 for j ? n}. (2.19) 

When E[|/(X)|] is finite and X\, . . . ,Xn are independent, McDiarmid's inequality 
bounds the deviations of f(X) from E[f(X)} in terms of the McDiarmid diameter 
of /: for any r > 0, 

F[f(X) -E[f(X)]<-r]< cxp (" ) , (2.20a) 

P[f(X) - E[f(X)] >r]< cxp (-^p) • (2.20b) 
McDiarmid's inequality implies that, for any fleRU {±oo}, 

P[/(*)<»l<^(- 2(E1 y )2 + ). (2.21a) 

v^i^(- , "'y )- <" ib » 

McDiarmid's inequality (and similar inequalities such as martingale inequalities) 
have the advantage that a bound on the tails of f{X) is obtained solely in terms 
of the mean output E[/(X)] and the McDiarmid diameter T>[f}. However, McDiar- 
mid's inequality cannot take advantage of any other properties of / such as con- 
vexity or monotonicity; furthermore, if / has infinite McDiarmid diameter on the 
essential range of X, then the trivial upper bound 1 is obtained. 

There are many other sources of concentration-of-measure inequalities: these 
include logarithmic Sobolev inequalities and the Herbst argument [2] [10] [12], the 
entropy method [3] [4] [14], and information-theoretic methods [' ] [19]. Of par- 
ticular interest are those concentration results that apply to infinite-dimensional 
settings [16]. 

3. Normal Distance 

As noted above, efficient presentation of many concentration-of-measure inequal- 
ities relies on having an appropriate notion of function variation (e.g. the Lipschitz 
norm or McDiarmid diameter) or distance (e.g. Talagrand's convex distance). The 
inequalities that will be established in section 4 can be phrased in terms of trans- 
forms of moment-generating functions, but are more transparent if phrased in terms 
of a normal distance, which will introduced in this section. 

Fix a function : X* —> [0,+co] that is positively homogeneous of degree one, 
i.e. such that ^(ai) = a^>(£) for all a > and all I £ X* . By analogy with the 
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situation in finite-dimensional Euclidean space, in which ^ — || • H2 on (1^)*, define 
the distance from a point x £ X to a half-space M p . v C X by 

d^{x^ v ) := {U 'l~* )+ , (3.1) 

with the convention that 0/0 = 0, since the distance from x £ X to the trivial 
half-space H Pil/ = X ought to be zero. Note that d±^(x, H Pi „) = whenever 
x £ H p .„; note also that the homogeneity assumption on "J ensures that (3.1) is 
an unambiguous definition. We now generalize (3.1) to more general subsets of X 
than half-spaces. The heuristic is that the distance from x to A C X should be 
the greatest possible distance (in the sense of (3.1)) from x to any half-space that 
contains A; the existence of the degenerate half-space H Pi o ensures that the normal 
distance is zero if there are no proper half-spaces that contain A. 

Definition 3.1. Let x £ X and A C X. The ^-normal distance from x to A, 
denoted d± i -^(x, A), is defined (with the same convention that 0/0 = 0) by 



d±,y(x, A) := sup 



(y, x — p)_ 



p £ X and v £ X 
such that A C 



(3.2) 



*(") 

The \I/-normal distance from A C X to B C X is defined by d±^(A, B) :— 
mi ae Ad±^(a, B). In the special case X — M. N and ^ = || • H2 on (M )*, we 
shall simply write d± for d±^, i.e. 



d±(x, A) := sup 



(v ■ (x-p))^ 



p£R N and v £ {R N Y 
such that A C 



(3.3) 



FII2 

Note well that the definition of the normal distance d±^,[x, A) does not require 
X to be normed; even when X is equipped with a norm || • \\x and ^> is the cor- 
responding operator norm, the normal distance A) is not the same as the 
Hausdorff distance from x to A defined by 

dHa US (a;,A) := inf{||x - a\\ x \ a £ A}; (3.4) 

see figure 3.1 for an illustration. Note also that it is not generally true that 
d±.\&(A, B) = d± } \s,(B, A): consider e.g. B := {(0,1)} and A as in figure 3.1, in 
which case 

d±^(A,B) = inf d_L.*(a,B) = 1^0 = d ±9 (B,A). 

For any x £ X and A C B C X, it holds that d±^(x,B) < d± } ^,(x, A). Fur- 
thermore, since a closed half-space H Pi „ contains ^4 if, and only if, it contains the 
closed convex hull co(A) of A, the following equality holds: 

d± t v(x,A) = d±^(x,W(A)) for all x G and all AC X. (3.5) 

Remark 3.2. It is natural to ask what, if any, relation there is between the normal 
distance and Talagrand's convex distance. The simplest answer is to say that the 
two distances should be compared only with great caution, since each belongs to 
a different setting: Talagrand's distance is defined on a product of sets, whereas 
the normal distance is defined on a topological vector space. Even on Mr , the 
two distances measure different quantities: in some sense, ^TaiO^ A) measures how 
many of the coordinates of x are covered by A, but does not measure the geometric 
distance between them; on the other hand, d±^(x, A) is a much more geometric 
measure of how far x is from A in terms of linear functionals on X, and the "size" 
of those linear functionals is measured by ^. In particular, Talagrand's convex 
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1 1 1 


/ \ A 

/ ff H aus(0,A) 
/ l 1 l ' 


d±(0,A) 




Figure 3.1. An example of a subset A of the Euclidean plane 
R 2 for which the normal distance d±(0,A) = 1 unit (cf. the 
dashed line), as opposed to the Euclidean Hausdorff distance 
^Haus(0, A) = 2 units (cf. the dotted arc). 



distance is positively homogeneous of degree zero, whereas the normal distance is 
positively homogeneous of degree one: for any x € M. N , A C WL N , and a > 0, 

d Ta ,i(ax,aA) = drai(x,A), 

d±^(ax,aA) = ad± t -q,(x, A). 

This section concludes with the proof of the portmanteau theorem (theorem 1.1) 
and some final remarks on its applicability: 

Proof of theorem 1.1. The equivalence will be established by showing that 
(i) =► (ii) => (iii) (iv) => (v) => (i). 

Suppose that (i) holds. Then 
PLY € K] 



< inf ¥[X G Hp^j by monotonicity of I 

< mf exp f_^m^) by(i) , 

= cxp(-i sup dx,*(E[^],Hp,„) 2 ) 

( d ± ^(E[X],Kf \ 
= exp[ I by (3.2). 

Hence, (i) implies (ii). 

Suppose that (ii) holds; then 

F[X E A] < ¥[X e co(A)] since A C co(A), 

( d^(nx],m(A)f \ 

< exp I I by (n), 

= ex p — — s h y ( 3 - 5 )> 
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and so (ii) implies (iii). (iv) follows from (iii) upon setting A := {x £ X \ f(x) < 9}. 
(v) is clearly a special case of (iv). (i) follows from (v) upon setting / := xn „ an d 
9:=1. □ 

Remark 3.3. It is important to note that all the bounds in theorem 1.1 may be 
trivial if the dual space X* is not rich enough. For example, given a measure space 
(Z, & ', fi), for < p < 1, the space 



C*(Z, := U:Z 



f\\p~ / \m\ p d(,( Z )\ < 



-oo 



is a topological vector space with respect to the quasinorm topology generated by 
|| • ||p. This space is not locally convex and has a trivial dual space: the only 
continuous linear functional on this space is the zero functional, and so the only 
closed half-space is the whole space. See e.g. [24, §1.47] for further discussion of 
spaces such as £ p ([0, 1]; K) for < p < 1. 

It is tempting to eliminate these pathologies by working with the algebraic, 
instead of the topological, dual of X. This can be done, and most results go 
through mutatis mutandis; in particular, it is necessary to replace all references to 
the closed convex hull co(A) of A C X with the convex hull co(^4); the analogue of 
(3.5) (with \I> now defined on the algebraic dual of X) is 

d±.qi(x, A) = d± i y(x, co(A)) for all x £ X and all A C X. 

The principal disadvantage of ignoring all topological structure on X, of course, is 
that there are no longer notions of interior, closure and frontier — although it still 
makes sense to discuss the extremal points of convex sets. 

4. Normal Distance as a Concentration Rate 

The method of Chernoff bounding (reviewed in lemma 6.1) gives bounds on 
V[X £ H p>!/ ] in terms of the moment-generating function Mx- If these bounds can 
be formulated in terms of a suitable normal distance, then theorem 1.1 produces 
equivalent bounds for on V[X £ K] for convex K, on F[X £ A], & c As noted 
in [18, §2], the best Chernoff bound on F[f(X) > 9] is never better than the best 
bound using the all the moments of f{X): if / takes only non- negative values, then 

inf 9- k E\f(X) k ] < inf e- se E\e sf{x) ] . (4.1) 
ken y K ' J ~ s >o L J v ' 

However, Chernoff bounds have the advantage that they are geometrically very 
easy to handle. 

The next result provides the normal distance formulation for an A"-valued Gauss- 
ian random variable (in fact, for a family of such variables). In the special case of a 
single Gaussian random vector X on X = M. N with covariance operator Cx = crljv, 
proposition 4.1 yields the classical Chernoff bound for a multivariate normal random 
variable. 

Proposition 4.1. Let T be a family of Gaussian random vectors in X . For each 
X £ r, let Cx ■ X* — > X** be its covariance operator defined by 

(C x l,v) :=E[(i,X)(v,X)]. (4.2) 

Let E := {E[X} \ X £ T}, let 



:= sup y/(C x v, v), (4.3) 
xer 
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and let dx,^r be the corresponding normal distance. Then, for any A C X , 

S u P P[leA]<exp (_ d±ME,Ar 
xer V 2 

Proof. For each IgT, the moment-generating function for X is given by 



M x (£) :=E 



exp [{£,nX}) 



(C x l,t) 



(4.4) 



(4.5) 



Therefore, 

F[X e M PtV ] 



< inf exp s{v,p- ELY]) + s 

S>0 



2 (C x v, v) 



= exp 
< exp 
= exp 



(v,E[X]-p)l 

2{C x v,v) 2 
(v,E[X]-p)\ 

d ± ^(E[X],M p ^) 2 



by (4.5) and lemma 6.1, 

by (4.3), 
by (3.2). 



Hence, by theorem 1.1, 



[X e A] < exp 



dj_^(E[X],AY 



and so 



sup P[X e A] < sup exp - 



dx,*mX],Af 



exp 



inf 

xer 



d±,*(E[JC],A) 5 



exp 



U^{E,Af 



□ 



Lemma 6.1 also has the following consequences for random vectors supported in 
a cuboid in ; this encompasses two standard situations in which concentration is 
often studied, namely concentration for functions on the Euclidean unit cube and 
on the Hamming cube. 

Proposition 4.2. Let X be a random vector in M. N with independent components 
such that each component X n almost surely takes values in a fixed interval of length 
L„ . Let 



*(!/) 



iV 



\ n=l 



and let dx^ be the corresponding normal distance. Then, for any A C R , 

[x 6 A]<e X p^ d -Mnx],Ar 



(4.6) 



(4.7) 



A fortiori, if X takes values in (a translate of) the unit cube [0, 1]^, then 

F[X eA}< exp (-2dx(E[X], A) 2 ) , (4.8) 
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and if X takes values in (a translate of) the Hamming cube {—!,+!} , then 



. d±(E[X],A) 2 \ 
F[X G A] < exp v 1 J ' ; . (4.9) 



Proof. The proof is similar to the Gaussian case: it is an application of lemma 
6.1 and Hoeffding's lemma [11, lemma 1 and (4.16)], which bounds the moment- 
generating function of X n as follows: 

Mxjtn) ■— E [exp(£ n X n )] < exp (t n E[X n ] + ^ 

Note that the claim can also be proved directly by applying McDiarmid's in- 
equality to the function (v, •), which has mean E[(f, X)] = (v, E[X]) and McDiarmid 
diameter y/L\ + ■■■ + L 2 N . □ 

Remark 4.3. Note the similarity between the normal distances of propositions 4.1 
and 4.2. In the Gaussian case, the norm on X* is the one induced by the "largest" 
covariance operator in the family of random variables T. In the bounded-range case, 
the norm on X* is the one induced by the "largest" covariance operator for random 
variables satisfying the range constraint: if X is a real-valued random variable 
taking values in an interval [a, b], then ^(v) 2 = j{b—a) 2 v 2 and Var[X] < \{b — a) 2 ; 
this upper bound on the variance is attained by a Bernoulli random variable with 
law i(5 Q + \8 b . 

The next result identifies the normal distance that corresponds to the concentra- 
tion of a vector, the entries of which are the empirical (sampled) means of functions 
of independent random variables. 

Proposition 4.4. For n = l,...,N, let Z n :— f n (Y n ^, . . . , Y n> j(t n \) oe a real- 
valued function of independent random variables Y n ^, and suppose that f n has finite 
McDiarmid diameter T>[f n ]. Let Z = {Z\, . . . , Zjq). Suppose that the random inputs 
of each f n are sampled independently M(n) times according to the distribution P 
and that the empirical average 



n=l 



is formed. Then, for any A C 



pN 



where the distance *f? : (R )* — > [0, +oo) is given in terms of the McDiarmid diam- 
eters of the functions /i, . . • , /at and the sample sizes M(l), . . . , M(N): 

1/2 



,(„): =U±»JSm , (4.12 



Proof. Let H p v C be a half-space. Consider the real-valued random variable 
{v.WyZ^j as a function of the sampled input random variables Suppose 
that the McDiarmid subdiameter of f n with respect to Y n ,k is Z? n ,/c- Then the 
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McDiarmid subdiameter of (i>, E[Z]) with respect to the m th sample of Y n ,k is 
VnD n ^k/M(n). Hence, the McDiarmid diameter of (v,¥\Z\\ is 



v 2 n 2 



E 



M(n) 2 



M(n) 



Therefore, since E[Z] is an unbiased estimator for E[Z] (i.e. E E[Z] 
McDiarmid's inequality (2.21a) implies that 



E[Z] e 



v,E[Z\) < (v,p) 



< exp — 



exp 



exp 



The claim now follows from theorem 1.1. 



l^n=\ M(n) 

(v,nz}- P ) 2 + 

i ^ V IM 2 

z ' 4 ' Zjn=l M(n) 
^,*(E[Z],H p ,„) 2 



E[Z]), 



□ 



An example of the application of proposition 4.4 is the following: 

Example 4.5 (Functions of empirical means). The Chernoff bounding method 
can be used to provide much-improved confidence levels for quantities derived from 
many empirical — as opposed to exact — means; see e.g. [25, §5]. Suppose that 
Hq : M. N — > M. is some function of interest: in particular, the quantity of interest is 
Ho (E[Zi], . . . , E[Zjv]) for some absolutely integrable real- valued random variables 
Zt, . . . , Zjq . If, however, the exact means E[Z n ] are unknown, then empirical means 
E[Z„] may be used in their place if appropriate confidence corrections are made. 
Suppose that "error" corresponds to concluding, based on the empirical means, 
that Hq(M[Z]) is smaller than it actually is. Given a £ Mr, set 



H a (zi, . . . , z N ) := H (z 1 + qi, . . . , z N + a N ). 



(4.13) 



Therefore, given any e > 0, we seek an appropriate "margin hit" a = a(e) £ M. N 
(typically, a n > for each n £ {1, . . . , N}) such that 



H a (%!],... ,E[Z N ]j >H Q (E[Z 1 ],...,E[Z N }) 



> 1 



Dually, given a £ K , we seek a sharp upper bound on the probability of error, i.e. 
on 



H a (E[Zi], . . . ,E[Z N }) < H [E[Zx], . . .,E[Z. 



N 



If Ho (and hence H a ) is monotonic in each of its N arguments and Z\, . . . , Zn 
are independent, then the probability of non-error can be bounded from below as 
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follows: 



H a (E[Z\) < flo(E[Z])] = P [H a (E[Z]j < H a (E[Z] - a) 



N 
n=l 



[S[Z„] < E[Z n ] - a, 

-2M{n){a n )\ 



< 1 - [] 1 " exp 



Unfortunately, when N is large, the last line of this inequality is typically close 
to zero unless the sample sizes are very large, and so this bound is of limited 
use. Geometrically, this is analogous to the fact that a high-dimensional orthant 
(product of half-lines) appears to be very narrow from the perspective of an observer 
at its vertex. In contrast, half-spaces always fill a half of the observer's field of view. 
To bound the probability of sublevel or superlevel sets using half-spaces requires 
H a to have some convexity — not monotonicity — properties. 

If H a is quasiconvex, then the bounds using normal distances can be applied to 
good effect, and yield estimates that actually perform better the larger N is. In 
particular, if H a is both quasiconvex and differentiable, then the outward normal 
to its t-level set at some point p is just any positive multiple of the derivative of 
H a at p, and this yields the bound 

2(Eil9nff a (p)(E[^]-Pn)) 2 

exp ' 



H a lE[Z] 



< 



< inf 

p:H a (p)<e 



(a„g a (p)) 2 P[/ n 

M(n) 



In particular, taking 9 = H (E[Z}) = H a {E[Z] 
in (4.14) &tp = E[Z] -ae 



H a (t[Z]) < H (E[Z}) 



N yields that 

1 2(j2n=idnH a (p)a 7 



(4.14) 

a) and evaluating the exponential 



< exp 



V 



Z^n=l 



M(n) 



(4.15) 



(4.15) is particularly useful since it links the margin hits a n , the sample sizes 
M(n), and the maximum probability of error. For example, given a desired level of 
confidence, margin hits a n , and a total number of samples M € N, one can choose 
sample sizes M(l), . . . , M(N) that sum to M and minimize the right-hand side of 
(4.15); this yields an optimal distribution of sampling resources so as to ensure that 

H a (E[Z]j > H (E[Z}) with the desired level of confidence. 

5. HlGH-DlMENSIONAL ASYMPTOTICS 

The topic of this section is the asymptotic sharpness of the bounds introduced 
above as the dimension of the space X becomes large. We begin with a comparison 
of the McDiarmid and half-space bounds for a simple function: a quadratic form 
on R N . 

Example 5.1 (Comparison with McDiarmid's inequality). The following example 
serves to illustrate how the half-space method can produce upper bounds on the 
measure of suitable sublevel sets that are superior to those offered by McDiarmid's 
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Figure 5.1. For the quadratic form Q N on R N given in (5.1), 
a comparison of the McDiarmid upper bound (squares) and 
half-space upper bound (triangles) on P[Qn{X) < 9} in the 
cases = j (dotted line and hollow polygons) and 9 = | (solid 
line and filled polygons). 



inequality; it also shows how this effect is more pronounced in higher-dimensional 
spaces. Consider the following quadratic form Qn on 'SL N : 



I 2 



Q N (x):=±\\x-(±,...,l)\\;. (5.1) 



For any 9 > 0, the sublevel set Q N ([— oo, 9]) is simply a ball of radius v 29 about 

the point Q, . . . , i). Suppose that a random vector X takes values in \_—\, +^] N 
with independent components. McDiarmid's inequality (2.21a) implies that 



N 



(X) <9}< exp 




If also K[X] = 0, then corollary 4.2 implies that 

(Vn- 



N 



(X) <9}< exp -- 



For small N and large 9, McDiarmid's bound is the sharper of the two. However, 
for small 9 (and, notably, as N — > oo for any fixed 9), the half-space bound is the 
sharper bound. See figure 5.1 for an illustration. 

The previous example suggests that bounds constructed using the half-space 
method may perform very well in high dimension but also that the sharpness of 
the bound may depend on "how round" the set whose measure we wish to bound 
is. To fix ideas, suppose that X — (Xx, . . . , X^) : il —> K N is a random vector with 
independent components, where X n is supported on an interval of length L n . For 
A C M. N , how sharp is the bound 

n xeA]< cx J- d ^ A n? (5.2) 



CONCENTRATION INEQUALITIES FOR LINEAR AND NON-LINEAR FUNCTIONS 15 




Figure 5.2. It is not reasonable to expect that (an upper 
bound for) the measure of the half-space H ei ._ ei is a sharp 
upper bound for the measure of the narrow wedge K e when e 
is small. 



First, note that since dj_(E[X], A) = d±(K[X],co(A)), the bound cannot be ex- 
pected to be sharp if A differs greatly from its closed convex hull, and so it makes 
sense to restrict investigation to the case that A = K, a closed and convex subset 
of M. N . Secondly, it is not reasonable to expect the bound (5.2) on F[X £ K] to 
be sharp if K is sharply pointed, e.g. if K is the narrow wedge K s of angle e -C 1 
based at e x := (1,0,..., 0) in R N : 

{x - ei) • ei 



K E := I x £ 



lN -' j r < g l ; (5.3) 

lF-ei||2 J 

see figure 5.2. Therefore, we wish to consider the opposite situation in which K has 
no sharp points, which will be made precise by requiring that K satisfy an interior 
ball condition. 

Suppose that (p, v) £ N*K is such that dj_(x,M pM ) — d±(x,K). Suppose also 
that M r (p — ruj) C K, with r > and lu £ M. N a unit vector, is an interior ball for K 
at p £ dK; cf. figure 5.3. If the law of X on is highly singular, then it cannot be 
expected that the bound (5.2) is sharp, so suppose that the law of X has a density 
with respect to Lebesgue measure that is bounded above by some constant C > 0. 
Then the bound (5.2) is 



"[X £ K] < exp 



2(v,E[X] -p) 

En=l V n L n 



2 



In the extreme case, K is precisely the closed ball M r (p — roS), the P-measure of 
which is at most Cr N ir N / 2 /T{l + N/2). 

In large deviations theory, the standard notion of asymptotic sharpness is loga- 
rithmic equivalence [ , §1.1]; see also e.g. [8] [28] for surveys of the large deviations 
literature. Two sequences (a n )neN and (/3 n )nGN are said to be logarithmically equiv- 
alent, denoted a n — /3 n , if 

1 1 / a \ 

- log a„ log /3 n = log ( ] -> as n -» oo. (5.4) 

n n \PnJ 

Are the half-space bound (5.2) and the measure of B r (p — rui) logarithmically equiv- 
alent? That is, does the conditional probability P [X £B r (p — ru) | X £ M PiV \ , 
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Figure 5.3. An interior ball of radius r for the closed convex 
set K at the frontier point p. Necessarily, p is a point at which 
dK is smooth; K admits no interior ball of positive radius at 
the vertex q. For convenience, the unit vector ui £ M. N has 
been identified with v e N*K C (R N )*. 



when raised to the power jj, converge to 1 as N — > oo? To simplify the asymp- 
totic expansions below, in all lines after the first two, we shall take E[X] = and 
Li = -- 

i logP [X e M r (p - *)] - 1 log (r.h.s. of (5.2)) 




which, by Stirling's approximation for the Gamma function [ , p. 256, eq. (6.1.37)] 
is approximately 

2(M 2 _ log(OV/ 2 ) 1 / / 2t: f l + N/2 \ 1+N/2 \ 
™ N\W\\ 2 2 + N N l0g [\l l + N/2 \ e J ) 

2{v,p) 2 _ logC 1 4tt l + N/2 N 
~ N\\v\\l + ~W ~ 2N g ~N N g Ye 



N\\v 



2 



log r — log v N 



Note that (v, p)^/\\v\\2 < VNdi(0,p), where d± denotes the weighted Hamming 
distance with weight w — (1,...,1). Therefore, a necessary (but not sufficient) 
condition for the half-space bound to be asymptotically sharp in the sense of loga- 
rithmic equivalence is that r is of the same order as y/N . That is, it is necessary 
that K is sufficiently round that it has an interior ball of radius comparable to \f~N 
at those frontier points where the normal distance dj_(E[X], K) is attained. 
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Now suppose that K = / 1 ([— oo, 9]) is a convex sublevel set for twice-differentiablc 
function /. Let 771, ... , r)N-i, v be a basis of M. N such that 

Wvih = ■ ■■ = WvN-ih = \Wh = 1 

and, for each n <E {1, ...,N — l}, r/ n is perpendicular to v. Suppose that, in this 
system of normal coordinates, near p, the frontier of K can be approximated by a 
parabola: 



dK = { yirj! 



, ?/Ar_i77Ar_i - y N v 



VN 



n=l J 



with Ai > A2 > • • • > Aat_i > 0. Then the condition that K admits an interior ball 
of radius r at p is the inequality 



\ 



N-1 



N-1 



N-1 



^2vl>Yl ^nVn whenever ^ y 2 n < 



This, in turn, leads to the following condition on Ai: it must hold that Ai < 5-. Put 
another way, the half-space method cannot be expected to provide asymptotically 
sharp bounds for ¥[f(X) < 6] if, when / is approximated in normal coordinates 
near the closest point of 00, 9]) to E[X] by a non-negative quadratic form, 

that quadratic form has an eigenvalue greater than (4iV) -1 / 2 . 

6. Appendix: Chernoff Bounds 

The method of Chernoff bounds [5, §7.4.3] [6] is a simple one in which the 
probability of a subset of X is bounded by that of a containing half-space, and the 
probability of that half-space is bounded using the moment-generating function of 
the probability measure. 



Lemma 6.1 (Chernoff bounds). For any half-space H p ,„ C X 



< inf e s ^' p) M x {-su). 



s>0 



For any convex set K C X , 



*[X e K}< inf e^Mxi-v) 
(p,i/)eN*K 



exp 



sup(Ax +X-n-k)*(p) 

p£K 



In particular, for any x G X , 

F[X = x]< exp(-A* x (x)). 
Proof. By the definition of the half-space H P)V , 
F[X€W Pi „]=V[(v,X) < (v,p)] 
= E [![<^p-x>>o]] 



(6.1) 

6.2a) 
6.2b) 

(6.3) 



< 



s(v,p-X) 



for any s > 0, 



< e s ^' p hl x (-sv). 
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Since this inequality holds for any s > 0, taking the infimum over all such s yields 
(6.1). Recall that the outward normal cone to a convex set at any point is closed 
under multiplication by non-negative scalars; hence, for any convex set K C X , 
taking the infimum of the right-hand side of (6.1) over half-spaces H Pj „ that contain 
K yields (6.2a). Now observe that 

inf e^' v) M x (-v) 

= , Ft expO^'P) +Ax(-^)) 
(p,w)gn*k 

= exp ( inf inf ((v,p) + A*(-i/)) 

= exp I - sup sup ((v,p) - AxM) ) 

= exp ( - sup(A x + X-^ k )*(p) J > 
V pek p / 

which establishes (6.2b); (6.3) follows as a special case. □ 
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